feat(integrations): add support for the litellm responses/aresponses APIs#6205
feat(integrations): add support for the litellm responses/aresponses APIs#6205constantinius wants to merge 2 commits intomasterfrom
litellm responses/aresponses APIs#6205Conversation
Codecov Results 📊✅ 2187 passed | ⏭️ 154 skipped | Total: 2341 | Pass Rate: 93.42% | Execution Time: 4m 55s All tests are passing successfully. ❌ Patch coverage is 0.00%. Project has 12726 uncovered lines. Files with missing lines (2)
Generated by Codecov Action |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 9f4b78a. Configure here.
9f4b78a to
bb31cad
Compare
| _input_callback(kwargs) | ||
| _success_callback( | ||
| kwargs, MockResponsesResponse(), datetime.now(), datetime.now() | ||
| ) |
There was a problem hiding this comment.
With SDK tests we aim to verify that we generate some telemetry based on the user's interaction with the library. We want to assert the presence of telemetry if the patched library is used as the user would use the library.
Currently, we assert that telemetry is generated if _input_callback and _success_callback are each invoked exactly once.
This is not always the case, and the assumption has resulted in unhandled SDK exceptions that were fixed in the commit below:
| class MockResponsesUsage: | ||
| def __init__(self, input_tokens=12, output_tokens=24, total_tokens=36): | ||
| self.input_tokens = input_tokens | ||
| self.output_tokens = output_tokens | ||
| self.total_tokens = total_tokens | ||
|
|
||
|
|
||
| class MockResponsesContentItem: | ||
| def __init__(self, text): | ||
| self.type = "output_text" | ||
| self.text = text | ||
|
|
||
|
|
||
| class MockResponsesOutputMessage: | ||
| def __init__(self, text): | ||
| self.type = "message" | ||
| self.role = "assistant" | ||
| self.content = [MockResponsesContentItem(text)] | ||
|
|
||
|
|
||
| class MockResponsesResponse: | ||
| def __init__( | ||
| self, | ||
| model="gpt-4.1-nano", | ||
| output=None, | ||
| usage=None, | ||
| ): | ||
| self.id = "resp-test" | ||
| self.model = model | ||
| self.output = output or [MockResponsesOutputMessage("the model response")] | ||
| self.usage = usage or MockResponsesUsage() |
There was a problem hiding this comment.
Related to https://github.com/getsentry/sentry-python/pull/6205/changes#r3201008608, we should aim to avoid custom types in our test suites.
As soon as we introduce custom types our tests are not coupled to the concrete types used in a library, and the tests no longer verify the SDK contract (namely, that telemetry is generated when a library is used like a user would interact with the library).
We can't hit real LLM APIs in the tests but we can do the next best thing: couple the sample response to the types in the library and patch at the lowest possible level.
This is done most of the tests in this test file, and there are helpers in the repo to accomplish writing effective tests (such as get_model_response()).
| if hasattr(response, "usage"): | ||
| usage = response.usage | ||
| record_token_usage( | ||
| span, | ||
| input_tokens=getattr(usage, "prompt_tokens", None), | ||
| output_tokens=getattr(usage, "completion_tokens", None), | ||
| total_tokens=getattr(usage, "total_tokens", None), | ||
| input_tokens=_read_usage_field(usage, "prompt_tokens", "input_tokens"), | ||
| output_tokens=_read_usage_field( | ||
| usage, "completion_tokens", "output_tokens" | ||
| ), | ||
| total_tokens=_read_usage_field(usage, "total_tokens"), | ||
| ) |
There was a problem hiding this comment.
We already probe above to determine which API is used.
As a result, reading prompt_tokens or input_tokens is dead code conditioned on knowing which API you are handling (adding cognitive overhead when reading).
| set_data_normalized( | ||
| span, SPANDATA.GEN_AI_RESPONSE_TEXT, response_messages | ||
| ) | ||
| elif hasattr(response, "output"): |
There was a problem hiding this comment.
You are adding code here which runs for all possible types of object that have an output field.
As a result the branch can easily be accidentally triggered as litellm evolves. There are multiple approaches to narrow down if you have a response in the Chat Completion API schema or a response in the Responses API schema. For example, you can check
isinstance(response, (ResponsesAPIResponse, BaseResponsesAPIStreamingIterator))based on the signature of the library function
| normalized = normalize_message_roles(input_messages) # type: ignore[arg-type] | ||
| messages_data = truncate_and_annotate_messages(normalized, span, scope) | ||
| if messages_data is not None: | ||
| set_data_normalized( |
There was a problem hiding this comment.
Based on the marshaling above you know that messages_data is a list. You should just use span.set_data() when you know the type of an attribute (again, removing cognitive overhead by avoiding dead code).
| The usage object can be either a typed Pydantic model (attribute access) or | ||
| a plain dict (litellm hands us a dict for the assembled async-streaming | ||
| response), so we try both shapes. |
There was a problem hiding this comment.
Why don't we just read from the dictionary int he asynchronous streaming scenario and otherwise access the attribute on the Pydantic model 😄 ?
These responses have types, so an isinstance check can tell you which branch you are in.
In the end we're developing against a library with a finite number of return types, and we should just check which case we are handling instead of probing around. Probing around is less robust, since new return types accidentally trigger hasattr() checks.
| for content_item in getattr(output, "content", []) or []: | ||
| text = getattr(content_item, "text", None) | ||
| if text is not None: | ||
| output_text.append(text) |
There was a problem hiding this comment.
This has reached a lot of indentation for Python code. Usually you can keep code readable by adding early returns or breaking up into functions where appropriate.

Description
Adds support for
responsesandaresponses, and their differences in output tracking. Also checking the conversation ID if it is passed in the extra_args.Contributes to https://linear.app/getsentry/issue/TET-2287/see-if-we-can-auto-extract-conversationid-from-openai-python